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Abstract. Open-ended constructed responses promote deeper processing of 
course materials. Further, evaluation of these explanations can yield important 
information about students’ cognition. This study examined how students’ 
constructed responses, generated at different points during learning, relate to 
their later comprehension outcomes. College students (N = 75) produced self- 
explanations during reading and explanatory retrievals after reading. The 
Constructed Response Assessment Tool (CRAT) was used to analyze these 
responses across multiple dimensions of language and relate these textual fea- 
tures to comprehension performance. Results indicate that the linguistic features 
of post-reading explanatory retrievals were more predictive of comprehension 
outcomes than self-explanations. Further, these models relied on different 
indices to predict performance. 
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1 Introduction 


Learning from text is a critical skill, but many students struggle with content-based 
reading [1]. Prompting students to generate constructed responses (e.g., verbal pro- 
tocols, summaries) is beneficial because it encourages active processing [2, 3] and these 
responses can also serve as “stealth assessments” [4, 5] of in situ learning that con- 
tinually update a learner model and drive feedback without needing to wait for more 
formal checkpoint quizzes or module exams. 

In the current study, we explore the use of explanatory retrieval prompts as stealth 
assessments. Explanatory retrievals are a type of constructed response in which stu- 
dents explain what they have just read from memory. As an elaborative or constructive 
version of retrieval practice, explanatory retrieval may yield superior comprehension as 
compared to free recall prompts or completing multiple-choice or fill-in-the-blank tests 
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[6, 7]. Not only is this approach effective, but it is also practical in the sense that asking 
students to “explain what you have just read about [topic]” rather than answer a series 
of quiz questions reduces the need for instructors or instructional designers to generate 
numerous items. Finally, these activities may have value as stealth assessments that can 
track students’ learning processes and progress. 

Although explanatory retrievals are beneficial for learning, they are often 
underutilized in the classroom due to the arduous nature of scoring open-ended 
responses [8]. Fortunately, natural language processing (NLP) tools have afforded an 
increased use of constructed responses within educational technologies [9, 10]. NLP 
analyses can be used to automate scoring and provide targeted feedback for a variety of 
constructed responses including think-alouds [11], self-explanations [12], summaries 
[13, 14], and essays [15]. Notably, the indices implicated in these analyses vary across 
constructed response type, presumably because they reflect different strategies and 
cognitive processes. Taken together, this research demonstrates the potential for ana- 
lyzing explanatory retrievals as a mode of stealth assessment, but also highlights the 
need to consider how explanatory retrievals might differ from other forms of con- 
structed response. 

Thus, in the current study, we examine how linguistic features of explanatory 
retrievals (ERs) relate to comprehension test performance. We also examine how ERs 
compare with another type of constructed response, self-explanation (SE), for which 
linguistic features have been studied. The prior research guides two primary 
hypotheses: 1) The linguistic features of the responses will provide information pre- 
dictive of subsequent comprehension test performance and 2) The features of ERs that 
predict comprehension performance will differ from the predictive features in SEs. In 
other words, as a retrieval (i.e., memory-based) process, post-reading ER may bring to 
bear different strategies and processes than what is found in concurrent SEs. 


2 Method 


2.1 Design and Procedure 


College students (N = 75; Mage = 25.04; 72% female; 13% ESL) read two science 
texts. At nine points in each text, students were directed to generate an SE. After 
reading, participants were prompted to produce an ER. The instructions specified the 
goal was not to simply recall as much as possible, but to provide a coherent explanation 
of the information in the text. After reading and explaining both texts, participants 
completed multiple-choice comprehension tests for each text. Each test included four 
memory items and four inference items. 


2.2 Data Processing 


SEs were combined to create an “aggregated SE” for each text [16-18]. These 
aggregated SEs and the ERs were submitted to the Constructed Response Analysis 
Tool (CRAT) [19]. CRAT calculates more than 700 indices related to 1) similarities 
(key words overlap, latent semantic analysis) between a source text and a constructed 
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response and 2) lexical sophistication and text properties. After the SEs and ERs had 
been analyzed by CRAT, the dataset was reduced based on multicollinearity and 
relation to the dependent variable. Thus, when two variables were highly multicollinear 
(r > .70), only the index most strongly related to the dependent variable was retained. 
Additionally, indices that exhibited a weak or absent relationship with the dependent 
variable (r < .10) were removed from the dataset. After this process, there were 50 
CRAT indices remaining for the machine learning analyses. 


2.3 Supervised Classification and Validation 


Supervised machine learning techniques were used to predict students’ comprehension 
scores. Caret for R [20] was used to train Linear Regression, Support Vector Machine 
(SVM), and Random Forest models. All models were evaluated using leave-one-out 
cross-validation (LOOCV) in which k — | instances were used in the training set and 
the model was tested on the instance not used in the training data. This process was 
repeated k times until each instance was used as the test set. LOOCV develops models 
that are more generalizable when applied to new data. 


3 Results 


On average, students’ aggregated SEs contained 172.69 (SD = 99.75) words, whereas 
their ERs contained 90.97 (SD = 45.52) words. Word count was included as a control 
variable in our models; however, it was not an important feature of any of the models. 

The response types (SE, ER) were tested independently using the same regression 
algorithms (Linear Regression, SVM, Random Forest). A summary of model accura- 
cies is presented in Table 1. Overall, the SVM performed the best for both SE and ER 
data. The CRAT indices accounted for 15% (SE) and 25% (ER) of variance in com- 
prehension scores, suggesting that the properties of the retrievals were more infor- 
mative of students’ comprehension of text content. 


Table 1. Description of model accuracy. 


Self-Explanation (SE) Explanatory Retrieval (ER) 
Algorithm RMSE R? RMSE R? 
Linear Regression 1.97 0.04 1.76 0.12 
SVM (Polynomial) 1.67 0.15 1.52 0.25 
Random Forest 1.67 0.13 1.50 0.24 


To more closely examine the CRAT indices driving the model predictions, we 
examined the scaled variable importance of indices in the SVM models. Four of the top 
five variables in the SE model were adjective keywords from the COCA corpus. They 
related to academic adjective keywords, magazine adjective keywords, fiction adjective 
keywords, news adjective keywords, and academic bigram keywords. In comparison, 
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the top five variables in the ER model were academic bigram keywords, word 
imageability, academic keywords, age of acquisition for content words, and fiction 
keywords. These results indicate that the descriptive content (i.e., adjectives) of the SEs 
were most predictive of comprehension scores, whereas the ERs were related to a wider 
variety of textual information, particularly lexical sophistication. 


4 Discussion 


This study examined the potential of explanatory retrievals (ERs) to serve as a form of 
stealth assessment of reading comprehension performance. Given that open-ended 
retrieval attempts can vary widely in quality [7], automating the evaluation of ER 
practice can make it more feasible to include ER tasks in the classroom. This study 
demonstrated modest, but promising results. In particular, our best model (SVM 
Polynomial) accounted for 15% and 25% of the variance using the properties of SEs 
and ERs, respectively. These results support the extant work demonstrating that natural 
language processing techniques can be used to model important comprehension pro- 
cesses [11-15]. 

A more novel finding in this present study is that, as predicted, different types of 
constructed responses were not uniformly related to reading comprehension perfor- 
mance. That is, SE responses and ER responses relied on some different features to 
predict comprehension and did so to different degrees of success. This supports the idea 
that different constructed responses influence and predict comprehension in different 
ways. Further work will more closely examine these different linguistic features in 
context to understand why different types of linguistic features are more or less pre- 
dictive in a particular type of response and how these different processes impact dif- 
ferent aspects of learning (i.e., memory vs. inference and application). The goal of this 
study was to compare and contrast across types of constructed responses and how each 
might provide different insights into learning processes. However, in future work, we 
plan to leverage the unique contributions of both in a combined model in which 
features of SEs and ERs are used to predict performance. 

One limitation of note is that LOOCV was conducted at the item level, with the 
same participants generating multiple items. Further research with larger data sets will 
examine how these models generalize to entirely independent datasets. In addition, this 
study relied only on the CRAT tool to analyze linguistic features of the constructed 
responses. Existing work on analysis of constructed responses [15-18] suggests that 
our models will have higher accuracy if they include indices that characterize text 
across multiple dimensions (e.g., lexical, syntax, cohesion). Thus, future work will 
examine the value of employing additional linguistic analysis tools to account for 
variance in other dimensions of language. 

Overall, the results of this study suggest that ERs can serve as both powerful 
learning activities and as assessments of developing comprehension. However, more 
work is needed to improve and refine automated procedures for scoring and providing 
feedback based on these responses. The ultimate goal of this research is to use these 
linguistic indices to facilitate nuanced assessments of constructed responses that can 
drive improved formative feedback and personalization in educational technologies. 
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